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Abstract—Breast ultrasound (BUS) image segmentation is chal- 
lenging and critical for BUS Computer-Aided Diagnosis (CAD) 
systems. Many BUS segmentation approaches have been proposed 
in the last two decades, but the performances of most approaches 
have been assessed using relatively small private datasets with dif- 
ferent quantitative metrics, which result in discrepancy in 
performance comparison. Therefore, there is a pressing need for 
building a benchmark to compare existing methods using a public 
dataset objectively, and to determine the performance of the best 
breast tumor segmentation algorithm available today and to inves- 
tigate what segmentation strategies are valuable in clinical prac- 
tice and theoretical study. In this work, we will publish a B-mode 
BUS image segmentation benchmark (BUSIS) with 562 images 
and compare the performance of five state-of-the-art BUS segmen- 
tation methods quantitatively. 


Index Terms—Breast ultrasound (BUS) images, segmentation, 
computer-aided diagnosis (CAD) system, benchmark. 


I. INTRODUCTION 


REAST cancer occurs in the highest frequency in women 
among all cancers, and is also one of the leading cause of 
cancer death worldwide [1]. The key to reduce the mortality is 


to find the signs and symptoms of breast cancer at its early stage. 


In current clinic practice, breast ultrasound (BUS) imaging with 
computer-aided diagnosis (CAD) system has become one of the 
most important and effective approaches for breast cancer de- 
tection due to its noninvasive, painless, non-radioactive and 
cost-effective nature. In addition, it is the most suitable ap- 
proach for large-scale breast cancer screening and diagnosis in 
low-resource countries and regions. 

CAD systems based on B-mode breast ultrasound (BUS) 
have been developed to overcome the inter- and intra-variabili- 
ties of the radiologists’ diagnoses, and have demonstrated the 
ability to improve the diagnosis performance of breast cancer 
[2]. Automatic BUS segmentation, extracting tumor region 
from normal tissue regions of BUS image, is a crucial compo- 
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nent ina BUS CAD system. It can change the traditional sub- 
jective tumor assessments into operator independent, reproduc- 
ible and accurate tumor region measurements. 

Automatic BUS image segmentation study attracted great at- 
tention in the last two decades due to clinical demands and its 
challenging nature, and generated many automatic segmenta- 
tion algorithms. We can classify existing approaches into semi- 
automatic and fully automatic according to “with or without” 
user interactions in the segmentation process. In most semi-au- 
tomatic methods, user needs to specify a region of interest (ROI) 
containing the lesion, a seed in the lesion, or an initial boundary. 
Fully automatic segmentation is usually considered as a top- 
down framework which models the knowledge of breast ultra- 
sound and oncology as prior constraints, and needs no user in- 
tervention at all. However, it is quite challenging to develop au- 
tomatic tumor segmentation approaches for BUS images, due 
to the low image quality caused by speckle noise, low contrast, 
weak boundary, and artifacts. Furthermore, tumor size, shape 
and echo strength vary considerably across patients, which pre- 
vent the application of strong priors to object features that are 
important for conventional segmentation methods. 

In previous works, all approaches were evaluated by using 
private datasets and different quantitative metrics (see Table 1) 
for performance measurements, which make the objective and 
effective comparisons among the methods impossible. As a 
consequence, it remains challenging to determine the best per- 
formance of the breast tumor segmentation algorithms available 
today, what segmentation strategies are valuable in clinic prac- 
tice and study, and what image features are helpful and useful 
in improving segmentation accuracy and robustness. We will 
build a BUS image segmentation benchmark which includes a 
large public BUS image dataset with ground truth, and investi- 
gate some objective and quantitative metrics for segmentation 
performance evaluation. 

In this paper, we present a BUS image segmentation bench- 
mark including 562 B-mode BUS images, and compare five 
state-of-the-art BUS segmentation methods by using seven 
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Article Type Year Categor # of images/Availabili Metrics 
Xian, et al. [3] F 2015 Graph-based 184/private TP, FP, SI, HD, MD 
Shao, et al. [18] F 2015 Graph-based 450/private TP, FP, SI 
Huang, et al.[4] Ss 2014 Graph-based 20/private ARE,TPVF, FPVF,FNVF 
Pons, et al [5] S 2014 Deformable models 163/private Sensitivity, ROC area 
Xian, et al. [6] F 2014 Graph-based 131/private SI, FP, AHE 
Kuo, et al.[7] Ss 2014 Deformable models 98/private DS 
Torbati, et al.[8] S 2014 Neural network 30/private JI 
Moon, et al. [9 S 2014 Fuzzy C-means 148/private Sensitivity and FP 
Jiang, et al.[10] Ss 2012 Adaboost 112/private Mean overlap ratio 
Shan, et al. [11] F 2012 Neural network 60/private TP, FP, FN, HD, MD 
Yang, et al.[12 S 2012 Naive Bayes classifier 33/private FP 
Shan, et al.[13] F 2012 Neutrosophic L-mean 122/private TP, FP, FN, SI, HD, and MD 
Liu, et al. [14] S 2012 Cellular automata 205/private TP, FP,FN, SI 
Hao, et al. [57] F 2012 DPM + CRF 480/private JI 
Gao, et al. [15] S 2012 Normalized cut 100/private TP, FP, SI, HD, MD 
Hao, et al. [38] F 2012 Hierarchical SVM + CRF 261/private JI 
Liu, et al. [16] S 2010 Level set-based 79/private TP, FP, SI 
Gomez, et al.[17] S 2010 Watershed 50/private Overlap ratio, NRV and PD 


Table 1. Recently published approaches. F: fully automatic, S: semi-automatic; SVM: support vector machine, CRF: conditional random field, 
DPM: deformable part model, TP: true positive, FP: false positive, SI: similarity, HD: Hausdorff distance, MD: mean distance, DS: Dice similarity, 
JI: Jaccard Index, ROC: Receiver operating characteristic, ARE: average radial error, TPVF : true positive volume fraction , FPVF : false positive 
volume fraction, FNVF: false negative volume fraction, PR: precision ratio, MR: match rate, NRV: normalized residual value, and PD: proportional 


distance. 


popular quantitative metrics. 

Wealso put the BUS dataset and the performances of the five 
approaches on http://cvprip.cs.usu.edu/busbench. To the au- 
thors’ best knowledge, this is the first attempt to benchmarking 
the BUS image segmentation methods. With the help of this 
benchmark, researchers can compare their methods with other 
algorithms, and find the primary and essential factors improv- 
ing the segmentation performance. 

The paper is organized as follows: in section II, a brief review 
of BUS image segmentation approaches is given; in section III, 
the set-up of the benchmark is presented; in section IV, the ex- 
periments and discussions are conducted; and in section V, the 
conclusion is presented. 


Il. RELATED WORK 


Many BUS segmentation approaches have been studied in 
the last two decades, and have been proved to be effective on 
their private datasets. In this section, we present a brief review 
of automatic BUS image segmentation approaches. For more 
details about BUS image segmentation approaches, refer the 
survey paper [19]. 

We classify the BUS image segmentation approaches in 
four categories: (1) deformable models, (2) graph-based ap- 
proaches, (3) learning-based approaches, and (4) classical ap- 
proaches (e.g., thresholding, region growing, and watershed). 

Deformable models (DMs): According to the ways of repre- 
senting the curves and surfaces, we can generally classify DMs 
into two subcategories: (1) the parametric DMs (PDMs) and (2) 
the geometric DMs (GDMs). 

In PDMs-based BUS image segmentation approaches, the 
main work was focused on generating good initial tumor bound- 
ary. Madabhushi et al. [20] proposed a fully automatic approach 
for BUS tumor segmentation by initializing PDMs using the 
boundary points produced in tumor localization step; and the 
balloon forces were employed in the extern forces. Chang et al. 


[21] utilized the sticks filter [22] to enhance edge and reduce 
speckle noise before using the PDMs. Huang et al. [23] pro- 
posed an automatic BUS image segmentation approach by us- 
ing the gradient vector flow (GVF) model [24], and the initial 
boundary was obtained by using the watershed approach. 

In GDMs-based BUS image segmentation approaches, many 
methods focused on dealing with the weak boundary and inho- 
mogeneity of BUS images. Gomez et al. [25] proposed a BUS 
image segmentation approach based on the active contour with- 
out edges (ACWE) model [26] which defined the stopping term 
on Mumford-Shah technique. The initial contour was a five- 
pixel radius circle centered at a point in the tumor marked by 
the user. Daoud et al. [27] built a two-fold termination criterion 
based on the signal-to-noise ratio and local intensity value. Gao 
et al. [28] proposed a level set approach based on the method in 
[29] by redefining the edge-based stop function using phase 
congruency [30] which was invariant to intensity magnitude, 
and integrated the GVF model into the level set framework. Liu 
et al. [16] proposed a GDMs-based approach which enforced 
priors of intensity distribution by calculating the probability 
density difference between the observed intensity distributions 
and the estimated Rayleigh distribution. 

Graph-based approaches: graph-based approaches gain 
popularity in BUS image segmentation because of their flexi- 
bility and efficient energy-optimization. The Markov random 
field - Maximum a posteriori - Iterated Conditional Mode 
(MRF-MAP-ICM) and the Graph cuts are the two major 
frameworks in graph-based approaches. 

MRF-MAP-ICM: Boukerroui et al. [31] stated that healthy 
and pathological breast tissues presented different textures on 
BUS images, and proposed an improved method in [32] by 
modeling both intensity and texture distributions in the 
likelihood energy; they also assumed that the texture features 
represented by using co-occurrence matrix follow the Gaussian 
distribution; and the parameters were estimated in a way similar 
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to the one in [32]. In [33], the Gaussian parameters in the like- 
lihood energy were defined globally and specified manually. 
[34] proposed a one-click user interaction to estimate Gaussian 
parameters automatically. 

Graph cuts: Xian et al. [3] proposed a fully automatic BUS 
image segmentation framework in which the graph cuts energy 
modeled the information from both the frequency and space do- 
mains. The data term (likelihood energy) modeled the tumor 
pose, position and intensity distribution. [35] built the graph on 
image regions, and initialized it by specifying a group of tumor 
regions (F’) and a group of background regions (B). The weight 
of any ¢-link was set to 0 if the node belonged to F N B, and all 
the other weights of ¢-/inks were set to 0; and the region inten- 
sity difference and edge strength discussed in [36] were applied 
to define the weight function of the smoothness term (prior en- 
ergy). [35] proposed a discriminative graph cut approach in 
which the data term was determined online by a pre-trained 
Probabilistic Boosting Tree (PBT) [37] classifier. 

In [38], hierarchical multiscale superpixel classification 
framework was proposed to define the data term. The hierar- 
chical classifier had four layers (20, 50, 200, and 800 superpix- 
els/nodes) built by using the normalized cut and k-means for 
multiscale representation; the histogram difference (Euclidean 
distance) between adjacent superpixels was used to define the 
weights in the smoothness term. 

Low optimization speed and locally optima are the two main 
drawbacks of the MRF-MAP-ICM; while the “shrinking” prob- 
lem is the main disadvantage of Graph cuts-based approaches. 

Learning based approaches: both supervised and unsuper- 
vised learning approaches have been applied to solve the BUS 
image segmentation problem. Unsupervised approaches are 
simple and fast, and commonly utilized as preprocessing to 
generate candidate image regions. Supervised approaches are 
good in integrating features at different levels, but not good in 
applying boundary constraints to generate accurate tumor 
boundary. 

Clustering: Xu et al. [39] proposed a BUS image segmenta- 
tion method applying the spatial FCM (sFCM) [40] to the local 
texture and intensity features. In sFCM, the membership value 
of each point was updated by using its neighbors’ membership 
values. In [39], the number of clusters was set as 2, and the 
membership values were assigned by using the modes of image 
histogram as the initial cluster centers. In [41], FCM was ap- 
plied to pixel intensities for generating image regions in four 
clusters; then morphology, location and size features were 
measured for each region; a linear regression model trained on 
the features was employed to produce the tumor likelihoods for 
all regions, and the region with the highest likelihood was con- 
sidered as a tumor. Moon et al. [9] applied FCM to image re- 
gions produced by using the mean shift method; the number of 
clusters was set to 4, and the regions belonging to the darkest 
cluster were extracted as the tumor candidates. Shan et al. [13] 
extended the FCM and proposed the neutrosophic l-means 
(NLM) clustering to deal with the weak boundary problem in 
BUS image segmentation; and it took the indeterminacy of 
membership into consideration. 

SVM and NN: Liu et al. [42] trained a SVM classifier using 


local image features to classify small image lattices (16 x 16) 
into the tumor or non-tumor classes; the radius basis function 
(RBF) was utilized; and 18 features (16 features from co-occur- 
rence matrix and the mean and variance of the intensities) were 
extracted from a lattice. Jiang et al. [10] trained Adaboost clas- 
sifier using 24 Haar-like features [43] to generate a set of can- 
didate tumor regions and trained SVM to determine the false 
positive and true positive regions. [44] proposed an NN-based 
method to segment 3D BUS images by processing 2D images 
slices using local image features. Othman et al. [45] trained 
two ANNs to determine the best-possible threshold. The ANN 
had 3 layers, 60 nodes in hidden layer, one node in the output 
layer. The first ANN used the Scale Invariant Feature Trans- 
form (SIFT) descriptors as the inputs; and the second employed 
the texture features from the Grey Level Co-occurrence Matrix 
(GLCM) as the inputs. [11] trained an ANN to conduct pixel- 
level classification by using the joint probability of intensity 
and texture [20] and two new features: the phase in the max- 
energy orientation (PMO) and radial distance (RD). The ANN 
had 6 hidden nodes and 1 output node. 

Deep Learning-based approaches have been reported to 
achieve state-of-the-art performance for many medical tasks 
such as prostate segmentation [46], cell tracking [47], muscle 
perimysium segmentation [48], brain tissue segmentation [49], 
breast tumor diagnosis [50], etc. Deep learning models have 
great potential to achieve good performance because of their 
ability to characterize big image variations and to learn compact 
image representation using sufficiently large BUS image da- 
taset. Deep learning architectures based on convolutional neural 
networks (CNNs) and recurrent neural networks (RNNs) are 
employed in medical image segmentation [46 - 50]. 

Classical approaches: Three most popular classical ap- 
proaches were applied to BUS image segmentation: threshold- 
ing, region growing and watershed. 

In BUS image segmentation, thresholding was often used as 
a pre-processing step for tumor localization. Region growing 
extracts image regions by starting from a set of pixels (called 
seeds) and growing seeds to large regions based on predefined 
growth criteria. In [11], Shan et al. proposed an automatic seed 
generation approach. In [52], Kwak et al. defined the cost of 
growing a region by modelling common contour smoothness 
and region similarity (mean intensity and size). 

Watershed could produce more stable results than threshold- 
ing and region growing approaches, and selecting the marker(s) 
is the key issue in watershed segmentation. Huang et al. [40] 
selected the markers based on grey level and connectivity. [54] 
applied watershed to determine the boundaries on binary image. 
The markers were set as the connected dark regions on the bi- 
nary image. [56] applied watershed and post-refinement based 
on grey level and location to generate candidate tumor regions. 

In Table 1, we list the brief information of 18 approaches 
published recently. 
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Final ground truth 


Figure 1. Ground truth generation 


Ill. BENCHMARK SETUP 


A. Approaches and Setup 


We obtained permissions from the developers of five state- 
of-the-art BUS segmentation methods [3, 11, 14, 16, 18] to use 
their codes. Approaches in [14] and [16] are interactive, and 
both need operator to specify regions of interest (ROIs) manu- 
ally; and the other three [3, 11, 18] are fully automatic. 

[16] is a level set-based segmentation approach and sets the 
initial tumor boundary by using user-specified ROI. The maxi- 
mum number of iterations is set to 450 as the stopping criterion. 
[14] is based on cell competition and uses the pixels on the 
boundary of the ROI specified by user as the background seeds 
and pixels on an adaptive cross at the ROI center as the tumor 
seeds. [11] utilizes predefined reference point (center of the up- 
per part of the image) for seed generation and pre-trained tumor 
grey-level distribution for texture feature extraction. We use the 
same reference point defined in [11] and the predefined grey- 
level distribution provided by the authors; 10-fold cross-valida- 
tion is employed to evaluate the overall segmentation perfor- 
mance. [3] and [18] are two graph-based fully automatic ap- 
proaches. In our experiments, we adopt all the parameters from 
the original papers correspondingly. 


B. Datasets and Ground Truth Generation 


Our BUS image dataset has 562 images. The images are 
collected by the Second Affiliated Hospital of Harbin Medical 
University, the Affiliated Hospital of Qingdao University, and 
the Second Hospital of Hebei Medical University using multi- 
ple ultrasound devices: GE VIVID 7 and LOGIQ E39, Hitachi 
EUB-6500, Philips 1022, and Siemens ACUSON S2000. The 


images from different resources may be valuable for testing the 
robustness of the algorithms. Informed consents to the protocol 
from all patients were acquired. The privacy of the patients is 
well protected. 

Four experienced radiologists are involved in the ground 
truth generation; three radiologists read and delineated each tu- 
mor boundary individually, and the fourth one (senior expert) 
will judge if the majority voting results need adjustment. The 
complete procedures of the ground truth generation are as fol- 
lows. 

Step I: every of three experienced radiologists delineates 
each tumor boundary manually, and three delineation results 
will be produced for each BUS image. 

Step 2: view all pixels inside/on the boundary as tumor re- 
gion, and outside pixels as background; conduct majority vot- 
ing to generate the preliminary result for each BUS image. 

Step 3: a senior expert will read each BUS image and refer 
its corresponding preliminary result to decide if it needs any ad- 
justment. 

Step 4: label tumor pixel as 1 and background pixel as 0; and 
generate a binary and uncompressed image to save the ground 
truth for each BUS image. 

An example of the ground truth generation is in Figure 1. 


C. Quantitative Metrics 


Among the five approaches, two of them [14, 16] are semi- 
automatic and user predefined ROI needs to be set before the 
segmentation; while the other three approaches [3, 11, 18] are 
fully automatic. The performance of semi-automatic ap- 
proaches may vary with different user interactions. It is difficult 
and meaningless to compare semi-automatic methods with fully 
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Figure 2. An example of 4 ROIs with LRs of 1.3, 1.5, 1.7 and 1.9, respectively. 


automatic methods; therefore, we will compare the methods in 
two categories separately. 

In the evaluation of semi-automatic approaches, we compare 
the segmentation performances of the two methods using the 
same set of ROIs, and evaluate the sensitivity of the methods to 
ROIs with different looseness ratio (LR) defined by 


where BD, is the size of the bounding box of the ground truth 
and is used as the baseline, and BD is the size of a ROI contain- 
ing BD». We produce 10 groups of ROIs with different LRs au- 
tomatically using the approach described in [55]: move the four 
sides of a ROI toward the image borders to increase the loose- 
ness ratio; and the amount of the move is proportional to the 
margin between the side and the image border. The LR of the 
first group is 1.1; and the LR of each of the other groups is 20% 
larger than that of its previous group. A BUS image and its four 
ROIs with different LRs are shown in Figure 2. 

The method in [11] is fully automatic, involves neural net- 
work training and testing, and a 10-fold cross validation strat- 
egy is utilized to evaluate its performance. Methods in [3, 18] 
need no training and operator interaction. All experiments are 
performed using a windows-based PC equipped with a dual- 
core (2.6 GHz) processor and 8 GB memory. The performances 
of these methods are validated by comparing the results with 
the ground truths. 

Both area and boundary error metrics are employed to assess 
the performance of the five approaches. The area error metrics 
include the true positive ratio (TPR), false positive ratio (FPR), 
Jaccard index (JI), Dice’s coefficient (DSC), and area error ratio 
(AER) 


|Am 9 A,| 
TPR = (1) 
lAm| 
[Am UA, — Aml 
FPR = (2) 
lAm| 
AmNA 
i lAm N A,| (3) 
[Am U A,| 
2|Am N A,| 
DSC = ——~—— (4) 
|Am| + |Ar| 
Am UA,| — |4m A 
ae! mV A,|— |Am N4,| (5) 
lAm| 


where A,, is the pixel set of the tumor region of the ground truth, 


A, is the pixel set of the tumor region generated by a segmenta- 
tion method, and | - | indicates the number of elements of a set. 
TPR, FPR and AER take values in [0, 1]; and FPR could be 
greater than 1 and takes value in [0, +00). Furthermore, 
Hausdorf error (HE) and mean absolute error (MAE) are used 
to measure the worst possible disagreement and the average 
agreement between two boundaries, respectively. Let C,,and C, 
be the boundaries of tumor in the ground truth and the segmen- 
tation result, respectively. The HE is defined by 


HE(Cyn, C,) = max{max{d(x, CJ}, maxid(y, Cm)}} (6) 


where x and y are the points on boundaries C,, and C,, respec- 
tively; d(-, C) is the distance between a point and a boundary C 
as 


d(z,C) = min{||z —k\} 


where ||z — k|| is the Euclidean distance between points z and 
k; and d(z, C) is the minimum distance between point z and all 
points on C. 

MAE is defined by 


MAE(Cym,C,) = 1/2 ». ee > oO, En) (7) 


xECm yeCy 

In Eq. (7), n, and 1,, are the numbers of points on bounda- 
ries C, and C,,, respectively. 

The seven metrics above were discussed in [19]. For the first 
two metrics (TPR and FPR), each of them only measures a 
certain aspect of the segmentation result, and is not suitable for 
describing the overall performance; e.g., a high TPR value 
indicates that most portion of the tumor region is in the 
segmentation result; however, it cannot claim an accurate 
segmentation because it does not measure the ratio of correctly 
segmented non-tumor regions. The other five metrics (JI, DSC, 
AER, HE and MAE) are more comprehensive and effective to 
measure the overall performance of segmentation approaches, 
and are commonly applied to tune the parameters of 
segmentation models [3], e.g., large JI and DSC and small AER, 
HE and MAE values indicate the high overall segmentation 
performance. 

Although JI, DSC, AER, HE and MAE are comprehensive 
metrics, we still recommend using both TPR and FPR for eval- 
uating BUS image segmentation; since with these two metrics, 
we can discover some hidden characteristics that cannot be 
found through the comprehensive metrics. Suppose that the al- 
gorithm has low overall performance (small JI and DSC, and 
large AER, HE and MAE); if FPR and TPR are large, we can 
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Figure 3. Segmetation results of [16]. (a) average TPR, FPR, JI, DSC and AER on different ROI sets; (b) average HE and MAE. 
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Figure 4. Segmentation results of [14]. (a) average TPR, FPR, JI, DSC and AER on different ROI sets; (b) average HE and MAE. 


conclude that the algorithm has overestimated the tumor region; 
if both FPR and TPR are small, the algorithm has underesti- 
mated the tumor regions. The findings from TPR and FPR can 
guide the improvement of the BUS segmentation algorithms. 


IV. APPROACHES COMPARISON AND DISCUSSIONS 


In this section, we evaluate five state-of-the-art approaches 
[11, 14, 16, 18, 26]. For fully automatic approaches, we com- 
pare their average performances by using the seven metrics dis- 
cussed in section III-C; while for semi-automatic approaches, 
we also evaluate their sensitivities to different LRs. 


A. Semi-automatic Segmentation 


Ten ROIs are generated automatically for each BUS image, 
and LRs range from 1.1 to 2.9 (step is 0.2). Totally, 5620 ROIs 
are generated for the entire BUS dataset, and we run each of the 
semi-automatic segmentation approach 5620 times to produce 
the results. All the segmentation results on the ROIs with the 
same LR are utilized to calculate the average TRP, FPR, DSC, 
AER, HE and MAE, respectively; and the results of the ap- 
proaches in [14] and [16] are shown in Figure 3 and Figure 4, 
respectively. 

All the average TPR and DSC values of the method in [16] 
are above 0.7, and its average JI values vary in the range [0.65, 
0.75]. The average TPR values increase with the increasing LR 
values of ROIs. Both the average JI and DSC values tend to 
increase firstly, and then decrease with the increasing LRs of 
ROIs. FPR, AER and HE have low average values when the 
LRs are small, which indicate that the high performance of 


method in [16] can be achieved by using tight ROIs; however, 
the values of the three metrics increase almost linearly with the 
LRs of ROIs when the looseness is greater than 1.3; this 
observation shows that the overall performance of [16] drops 
rapidly by using large ROIs above a certain level of LR. The 
average MAE values decrease firstly, and then increase and 
vary with the LRs in a small range. Four metrics (average JI, 
DSC, AER and MAE) reach their optimal values at the LR of 
1.5 (Table 2). 

The segmentation results of the approach in [14] are 
demonstrated in Figure 4. All average JI values are between 0.7 
and 0.8; and all average DSC values are between 0.8 and 0.9. 
As the method in [16], the average TPR values of the method 
in [14] are above 0.7; and increase with LRs of ROIs; the 
average JI and DSC values increase firstly, and then decrease; 
the average FPR values increase with the increasing looseness 
of ROIs; and the average DSC, HE and MAE decrease firstly, 
and then increase. Five metrics (average JI, DSC, AER, HE and 
MAB) reach their optimal values at the LR of 1.9 (Table 2). 

As shown in Figures 3 and 4 and Table 2, the two approaches 
achieve their best performances with different LRs (1.5 and 1.9 
respectively). We can also observe the following facts: 

e [16] and [14] are quite sensitive to the sizes of ROIs; their 

performances vary greatly at different LRs. 

e Every approach achieves the best performance at a certain 

value of LR; however, not at the lowest looseness level (1.1). 


Metrics F Area error metrics Boundary error metrics Time 
Looseness Ratio 
Methods Ave. TPR | Ave.FPR | Ave.JI | Ave.DSC | Ave. AER | Ave.HE | Ave. MAE | Ave. Time (s) 
Ll 0.73 0.08 0.67 0.78 0.35 45.4 12.6 18 
1.3 0.79 0.10 0.72 0.82 0.31 42.2 10.9 22 
1.5 0.82 0.13 0.73 0.84 0.31 44.0 10.4 27 
1.7 0.83 0.17 0.73 0.83 0.33 48.3 10.9 27 
[16] 1.9 0.85 0.20 0.72 0.83 0.36 51:3 11.2 30 
2.1 0.86 0.24 0.71 0.82 0.39 54.9 11.7 30 
2.3 0.86 0.27 0.70 0.82 0.41 57.0 12.1 36 
2.5, 0.87 0.32 0.69 0.80 0.46 61.3 13.1 39 
2.7 0.87 0.35 0.68 0.79 0.48 62.1 13.4 40 
2.9 0.86 0.40 0.66 0.77 0.54 66.2 14.6 44 
1.1 0.70 0.01 0.70 0.82 0.31 35.8 11.1 487 
1.3 0.76 0.02 0.75 0.85 0.26 32.0 9.1 467 
1.5 0.79 0.03 0.77 0.87 0.23 29.9 8.1 351 
L.7 0.82 0.05 0.79 0.88 0.23 29:5 7.8 341 
[14] 1.9 0.84 0.07 0.79 0.88 0.23 29.0 7.6 336 
2.1 0.86 0.10 0.79 0.88 0.24 29.5 17 371 
2.3 0.87 0.13 0.78 0.87 0.26 31.3 8.3 343 
2.5 0.89 0.16 0.77 0.87 0.28 31.9 8.5 365 
2.7 0.90 0.20 0.75 0.85 0.31 34.1 9.2 343 
2.9 0.90 0.25 0.73 0.84 0.35 36.9 10.2 388 
Table 2. Quantitative results of [14] and [16] using 10 LRs of ROL 
F Area error metrics Boundary, SrrOn me Time 
Metrics rics 
Mretiars Ave. TPR | Ave.FPR | Ave. JI a eS Ave. HE | Ave. MAE po ae 
Fully automatic approaches 
[3] 0.81 7 
[11] 0.81 16 
[18] 0.67 0.18 0.61 0.71 0.51 69.2 21.3 20 
Semi-automatic approaches 
[16], LR=1.5 0.82 0.13 0.73 0.84 0.31 44.0 10.4 27 
[24], LR=1.9 0.84 0.07 0.79 0.88 0.23 29.0 7.6 371 


Table 3. Results of three fully automatic tumor segmentation approaches [3, 11, 18] and the average optimal performances of [14] and [16]. 


e The performances of the two approaches drop if the loose- 
ness level is greater than a certain value; and the perfor- 
mance of the method [14] drops much slower than that of 
the method in [11]. 

e Set 1.9 as the optimal LR for [14] and 1.5 for [16]; and [14] 
achieves better optimal average performance than that of 
[16]. 

e The running time of the approach in [16] is proportional to 
the size of a user specified ROI, while there is no such rela- 
tionship of the running time of the approach in [14]. 

e The running time of the approach in [14] is slower than that 
of the approach in [16] by one order of the magnitude. 


B. Fully Automatic Segmentation 


[3], [11] and [18] are three fully automatic approaches and 
their segmentation performances are shown in Table 3. The 
method in [3] achieves better performance than that of the meth- 
ods in [11] and [18] on all five comprehensive metrics and its 
average FPR is also the lowest. The method in [11] has the 
same average TPR value as the method in [3]; however, its 
average FPR value reaches 106.5% which is almost six times 
larger than that of the method in [3]; the high average FPR and 
AER values of the method in [11] indicate that a large portion 
of non-tumor regions are misclassified as tumor regions. 
Among the three fully automatic approach, [18] has the lowest 


average TPR (62.1%); however, it outperforms the method in 
[11] on all the other six metrics. 

In Table 3, we also show the average optimal performances 
of the methods in [16] and [14] at the ZRs of 1.5 and 1.9, re- 
spectively. Their performances outperform that of all the fully 
automatic approaches, and [14] achieves the best performance 
among them; but the two approaches are much slower than the 
fully automatic approaches, and operators are impossible to 
know what the best ROI sizes are beforehand which can lead to 
the best segmentation results. 


C. Discussions 


As discussed in [19], many semi-automatic segmentation ap- 
proaches are utilized for BUS image segmentation. User inter- 
actions (setting seeds and/or ROIs) are required in these 
approaches, and could be useful for segmenting BUS images 
with extremely low quality. As shown in Table 3, the two inter- 
active approaches could achieve better performances than many 
fully automatic approaches if the ROI size is set properly. 

Figures 3 and 4 also demonstrate that the two approaches 
achieve different performances using different sizes of ROIs. 
Therefore, the major issue in semi-automatic approaches is how 
to determine the best ROIs/seeds. But such issue has been ne- 
glected before; most semi-automatic approaches were only fo- 
cused on improving segmentation performance by designing 
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more complex features and segmentation models, and did not 
consider user interaction as an important factor that could affect 
the segmentation performance. Hence, we recommend re- 
searchers that they should consider this issue when they develop 
semi-automatic approaches. Two possible solutions could be 
applied to solve this issue. First, for a given approach, we could 
choose the best LR by running experiments on a given BUS im- 
age training set (like section IV-A) and apply the LR to the test 
set. Second, like the interactive segmentation approach in [55], 
we could bypass this issue by designing segmentation models 
less sensitive to user interactions. 

Fully automatic segmentation approaches have many good 
properties such as operator-independence and reproducibility. 
The key strategy that shared by many successful fully automatic 
approaches is to localize the tumor ROI by modeling the do- 
main knowledge. [11] localizes tumor ROI by formulizing the 
empirical tumor location, appearance and size; [22] generates 
tumor ROI by finding adaptive reference position; and in [18], 
the ROI is generated to detect the mammary layer of BUS im- 
age, and the segmentation algorithm only detects the tumor in 
this layer. However, in many fully automatic approaches, some 
inflexible constraints are utilized which lower their robustness; 
e.g., [11] utilizes the fixed reference position to rank the 
candidate regions in the ROI localization process, and achieves 
good performance on the private BUS dataset; however, it fails 
to localize 8 images of the benchmark. As discussed in [19], to 
avoid the degradation of segmentation performance for BUS 
images not collected under the same controlled settings, it is 
important to develop unconstrained segmentation techniques 
that are invariant to image settings. 

As shown in Table 1, many different quantitative metrics ex- 
ist for evaluating the performances of BUS image segmentation 
approaches. In this paper, we have applied seven metrics rec- 
ommended in [19] to evaluate BUS image segmentation ap- 
proaches. As shown in Figures 3 and 4, average JI, DSC and 
AER have the same trend, and each of them is sufficient to eval- 
uate the area error comprehensively. 


V. CONCLUSION 


In this paper, we establish a BUS image benchmark and pre- 
sent the comparing results of five state-of-the-art BUS segmen- 
tation approaches; two of them are semi-automatic and others 
are fully automatic. The BUS dataset contains 562 BUS images 
collected using three different ultrasound machines; therefore, 
the images have large variance in terms of image contrast, 
brightness and degree of noise, and can be valuable for testing 
the robustness of the algorithms as well. In the five approaches, 
two of them [3, 18] are graph-based approaches, [11] is ANN- 
based approach, [16] is a level set-based segmentation approach, 
and [14] is based on cell competition. 

The quantitative analysis of the considered approaches high- 
lights the following important issues. 

e By using the benchmark, no approaches in this study can 
achieve the same performances reported in their original 
papers. 

e The two semi-automatic approaches are quite sensitive to 
user interaction (LR). 


e The approach modelling knowledge that is more robust can 
achieve better performance on image dataset with large var- 
iance in image quality. 

e The approach based on strong constraints such as prede- 
fined reference point, intensity distribution, and ROI size 
cannot handle BUS images from different sources well. 

e The quantitative metrics such as JI, DSC, AER, HE and 
MAE are more comprehensive and effective to measure the 
overall segmentation performance than TPR and FPR; how- 
ever, TPR and FPR are also useful for developing and im- 
proving algorithms. 

In addition, the benchmark should be and will be expanded 
continuously. 
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